An increasing number of public datasets have shown a marked clinical impact on assessing anatomical structures. However, each of the datasets is small, partially labeled, and rarely investigates severe tumor subjects. Moreover, current models are limited to segmenting specific organs/tumors, which can not be extended to novel domains and classes. To tackle these limitations, we introduce embedding learned from Contrastive Language-Image Pre-training (CLIP) to segmentation models, dubbed the CLIP-Driven Universal Model. The Universal Model can better segment 25 organs and 6 types of tumors by exploiting the semantic relationship between abdominal structures. The model is developed from an assembly of 14 datasets with 3,410 CT scans and evaluated on 6,162 external CT scans from 3 datasets. We rank first on the public leaderboard of the Medical Segmentation Decathlon (MSD) and achieve the state-of-the-art results on Beyond The Cranial Vault (BTCV). Compared with dataset-specific models, the Universal Model is computationally more efficient (6x faster), generalizes better to CT scans from varying sites, and shows stronger transfer learning performance on novel tasks. The design of CLIP embedding enables the Universal Model to be easily extended to new classes without catastrophically forgetting the previously learned classes.
translated by 谷歌翻译
Long document retrieval aims to fetch query-relevant documents from a large-scale collection, where knowledge distillation has become de facto to improve a retriever by mimicking a heterogeneous yet powerful cross-encoder. However, in contrast to passages or sentences, retrieval on long documents suffers from the scope hypothesis that a long document may cover multiple topics. This maximizes their structure heterogeneity and poses a granular-mismatch issue, leading to an inferior distillation efficacy. In this work, we propose a new learning framework, fine-grained distillation (FGD), for long-document retrievers. While preserving the conventional dense retrieval paradigm, it first produces global-consistent representations crossing different fine granularity and then applies multi-granular aligned distillation merely during training. In experiments, we evaluate our framework on two long-document retrieval benchmarks, which show state-of-the-art performance.
translated by 谷歌翻译
Artificial Intelligence (AI) is having a tremendous impact across most areas of science. Applications of AI in healthcare have the potential to improve our ability to detect, diagnose, prognose, and intervene on human disease. For AI models to be used clinically, they need to be made safe, reproducible and robust, and the underlying software framework must be aware of the particularities (e.g. geometry, physiology, physics) of medical data being processed. This work introduces MONAI, a freely available, community-supported, and consortium-led PyTorch-based framework for deep learning in healthcare. MONAI extends PyTorch to support medical data, with a particular focus on imaging, and provide purpose-specific AI model architectures, transformations and utilities that streamline the development and deployment of medical AI models. MONAI follows best practices for software-development, providing an easy-to-use, robust, well-documented, and well-tested software framework. MONAI preserves the simple, additive, and compositional approach of its underlying PyTorch libraries. MONAI is being used by and receiving contributions from research, clinical and industrial teams from around the world, who are pursuing applications spanning nearly every aspect of healthcare.
translated by 谷歌翻译
Transformer-based models, capable of learning better global dependencies, have recently demonstrated exceptional representation learning capabilities in computer vision and medical image analysis. Transformer reformats the image into separate patches and realize global communication via the self-attention mechanism. However, positional information between patches is hard to preserve in such 1D sequences, and loss of it can lead to sub-optimal performance when dealing with large amounts of heterogeneous tissues of various sizes in 3D medical image segmentation. Additionally, current methods are not robust and efficient for heavy-duty medical segmentation tasks such as predicting a large number of tissue classes or modeling globally inter-connected tissues structures. Inspired by the nested hierarchical structures in vision transformer, we proposed a novel 3D medical image segmentation method (UNesT), employing a simplified and faster-converging transformer encoder design that achieves local communication among spatially adjacent patch sequences by aggregating them hierarchically. We extensively validate our method on multiple challenging datasets, consisting anatomies of 133 structures in brain, 14 organs in abdomen, 4 hierarchical components in kidney, and inter-connected kidney tumors). We show that UNesT consistently achieves state-of-the-art performance and evaluate its generalizability and data efficiency. Particularly, the model achieves whole brain segmentation task complete ROI with 133 tissue classes in single network, outperforms prior state-of-the-art method SLANT27 ensembled with 27 network tiles, our model performance increases the mean DSC score of the publicly available Colin and CANDI dataset from 0.7264 to 0.7444 and from 0.6968 to 0.7025, respectively.
translated by 谷歌翻译
本文介绍了Omnivl,这是一种新的基础模型,旨在使用一种通用体系结构来支持图像语言和视频语言任务。它为图像和视频输入采用了统一的基于变压器的视觉编码器,因此可以执行联合图像语言和视频语言预处理。我们首次证明了这样的范式受益于图像和视频任务,而不是传统的单向传输(例如,使用图像语言来帮助视频语言)。为此,我们提出了对图像语言和视频语言的脱钩关节预处理,以有效地将视觉模型分解为空间和时间维度,并在图像和视频任务上获得性能提升。此外,我们引入了一种新颖的统一视觉对比度(UNIVLC)损失,以利用图像文本,视频文本,图像标签(例如,图像分类),视频标签(例如,视频动作识别)在一起受到监督和吵闹的监督预处理数据都尽可能多地利用。无需额外的任务适配器,Omnivl可以同时支持仅视觉任务(例如,图像分类,视频操作识别),跨模式对齐任务(例如,图像/视频 - 文本检索)和多模式理解和生成任务(例如,图像/视频问答,字幕)。我们在各种下游任务上评估Omnivl,并以相似的模型大小和数据量表获得最新的或竞争结果。
translated by 谷歌翻译
对于人类,使用视觉信号了解对象之间的关系是直观的。但是,对于人工智能,这项任务仍然具有挑战性。研究人员在研究语义关系检测方面取得了重大进展,例如人类对象的相互作用检测和视觉关系检测。我们将视觉关系的研究从语义到几何发展迈进了一步。在具体上,我们预测相对阻塞和相对距离关系。但是,从单个图像中检测这些关系具有挑战性。强制集中注意特定于任务的区域在成功检测这些关系方面起着关键作用。在这项工作中,(1)我们提出了一种新颖的三年级架构,作为集中注意力的基础架构。 2)我们使用广义交叉框预测任务有效地指导我们的模型专注于遮挡特定区域; 3)我们的模型在距离感知关系检测方面实现了新的最新性能。具体而言,我们的模型将F1分数从33.8%提高到38.6%,并将闭塞F1得分从34.4%提高到41.2%。我们的代码公开可用。
translated by 谷歌翻译
排名者在事实上的“检索和rerank”管道中起着必不可少的作用,但其训练仍然落后 - 从中​​度的负面因素或/和/和/和作为回收者的辅助模块中学习。在这项工作中,我们首先确定了强大的排名者的两个主要障碍,即是由训练有素的回猎犬和非理想的负面负面的固有标签噪声,该噪声是为高能力的排名所采样的。因此,我们提出多个检索器,因为负面发电机改善了排名者的鲁棒性,其中i)涉及广泛的分发标签噪声,使排名者与每个噪声分布相对,而ii)与排名相对较接近排名负分配,导致更具挑战性的培训。为了评估我们的强大排名者(称为r $^2 $ anker),我们在各种环境中进行了有关流行通道检索基准测试的各种实验,包括BM25级,全等级,回收者蒸馏等。经验结果验证了新的州 - 新州 - 新州 - 我们模型的效果。
translated by 谷歌翻译
基于参考的超分辨率(REFSR)在使用外部参考(REF)图像产生现实纹理方面取得了重大进展。然而,现有的REFSR方法可以获得与输入大小一起消耗二次计算资源的高质量对应匹配,限制其应用程序。此外,这些方法通常遭受低分辨率(LR)图像和REF图像之间的比例错位。在本文中,我们提出了一种加速的多尺度聚合网络(AMSA),用于基于参考的超分辨率,包括粗略嵌入式斑块(CFE-PACKPMATCH)和多尺度动态聚合(MSDA)模块。为了提高匹配效率,我们设计一种具有随机样本传播的新型嵌入式PACKMTH方案,其涉及具有渐近线性计算成本的端到端训练到输入大小。为了进一步降低计算成本和加速会聚,我们在构成CFE-PACKMATCH的嵌入式PACKMACTH上应用了粗略策略。为了完全利用跨多个尺度的参考信息并增强稳定性的稳定性,我们开发由动态聚合和多尺度聚合组成的MSDA模块。动态聚合通过动态聚合特征来纠正轻微比例的错位,并且多尺度聚合通过融合多尺度信息来为大规模错位带来鲁棒性。实验结果表明,该拟议的AMSA对定量和定性评估的最先进方法实现了卓越的性能。
translated by 谷歌翻译
非本地注意力(NLA)通过利用自然图像中的内在特征相关性来带来单幅图像超分辨率(SISR)的显着改进。然而,NLA提供嘈杂的信息大量的权重,并且相对于输入大小消耗二次计算资源,限制其性能和应用。在本文中,我们提出了一种新的高效非局部对比度注意(Enca),以执行远程视觉建模并利用更相关的非局部特征。具体而言,Enca由两部分组成,有效的非本地注意力(Enla)和稀疏聚合。 ENLA采用内核方法来近似指数函数并获得线性计算复杂度。对于稀疏聚合,我们通过放大因子乘以专注于信息特征的输入,但近似的方差呈指数增加。因此,应用对比学习以进一步分离相关和无关的特征。为了展示Enca的有效性,我们通过在简单的骨干中添加一些模块来构建称为有效的非本地对比网络(ENLCN)的架构。广泛的实验结果表明,Enlcn对定量和定性评估的最先进方法达到了卓越的性能。
translated by 谷歌翻译
Classification using supervised learning requires annotating a large amount of classes-balanced data for model training and testing. This has practically limited the scope of applications with supervised learning, in particular deep learning. To address the issues associated with limited and imbalanced data, this paper introduces a sample-efficient co-supervised learning paradigm (SEC-CGAN), in which a conditional generative adversarial network (CGAN) is trained alongside the classifier and supplements semantics-conditioned, confidence-aware synthesized examples to the annotated data during the training process. In this setting, the CGAN not only serves as a co-supervisor but also provides complementary quality examples to aid the classifier training in an end-to-end fashion. Experiments demonstrate that the proposed SEC-CGAN outperforms the external classifier GAN (EC-GAN) and a baseline ResNet-18 classifier. For the comparison, all classifiers in above methods adopt the ResNet-18 architecture as the backbone. Particularly, for the Street View House Numbers dataset, using the 5% of training data, a test accuracy of 90.26% is achieved by SEC-CGAN as opposed to 88.59% by EC-GAN and 87.17% by the baseline classifier; for the highway image dataset, using the 10% of training data, a test accuracy of 98.27% is achieved by SEC-CGAN, compared to 97.84% by EC-GAN and 95.52% by the baseline classifier.
translated by 谷歌翻译